1. Introduction

Description of the Project

This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We analyze variables across different categories and examine differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.

The analysis of various indicators allows us to examine temporal changes and potential external factors that might shape educational outcomes.
We selected two major indicator categories that reflect different dimensions of the education system:

  1. Enrollment Rate vs. Attendance Rate
    We will compare enrollment and attendance rates across different levels of schooling (primary, secondary, tertiary) to test whether higher enrollment leads to consistent attendance over time.

  2. Number of Teachers vs. Teaching Staff Compensation
    We will investigate how changes in staffing levels correspond with compensation across educational stages.

Motivation

When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education. This is something that all of us are connected to and that our classmates and readers can easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage.

As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?


2. Dataset


3. Setup & Packages

We will perform Exploratory Data Analysis (EDA) in R using the tidyverse suite of libraries along with some supplementary packages for smoother visualization.
Our analysis will include the following components:

Before proceeding, we install and load all necessary packages, then set global options for reproducibility and consistent output formatting. We also create global functions that will streamline table formatting for data outputs of interest.

# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("readr")
# install.packages("knitr")
# install.packages("plotly")
# install.packages("rmarkdown")
# install.packages("broom")
# install.packages("ggpubr")
# install.packages("janitor")
library(kableExtra)
library(tidyverse) # core data manipulation & visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr) # efficient CSV import
library(knitr)
library(plotly) # interactive data visualization
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(janitor) # column name cleaning & simple tabulations
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) # data parsing and time handling
library(rmarkdown)
library(broom)
library(ggpubr)

#functions to format dts
pretty_r2_table <- function(df, caption_text){
  df %>%
    kable("html", caption = caption_text) %>%
    kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(3, color = ifelse(df$p.value < 0.05, "red", "black"))
}

pretty_cor_table <- function(x, y, var1_name = NULL, var2_name = NULL) {
  cor_res <- cor.test(x, y)
  tibble(
    Variable1 = var1_name,
    Variable2 = var2_name,
    Correlation = round(cor_res$estimate, 3),
    p.value = signif(cor_res$p.value, 3)) %>%
    kable("html", caption = paste0("Correlation: ", var1_name, " vs ", var2_name)) %>%
    kable_styling(full_width = F) %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(4, color = ifelse(cor_res$p.value < 0.05, "red", "black"))
}

set.seed(123)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)

4. Data Import & Cleaning

This dataset is very large and contains data from various countries. As previously mentioned we are focusing only on the US data. We start from reading the raw EdStats dataset (EdStats_v01.csv) and filter only the United States(Country code == "USA"). We also drop fully empty columns (2024 column) as they will not be needed in our analysis. As a check, print any columns that contain all NA values and proceed only if the null set is output. Then, we write the filtered dataset to EdStats_USA.csv so that later chunks can work only with our US data without filtering for it multiple times.

This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.

#set up data for use
#read data
data <- read_csv("EdStats_v01.csv")

# filter for only USA data, remove NA cols
data_clean <- data %>%
  filter(grepl("USA", `Country code`)) %>%
  select(-`2024`) #all na

#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
#save filtered dataset
write_csv(data_clean, "EdStats_USA.csv")

To keep downstream steps tidy and reproducible, we create three thematic subsets aligned to our research questions:

  1. Enrollment vs. Attendance — indicators with “total net enrolment rate” or “total net attendance rate” (exclude parity indices and sex-specific series for apples-to-apples totals).
    • Note that “enrolment” is misspelled, this is corrected later in code.
    • Dropped all-NA columns and saved as EdStats_attend.csv.
  2. Teachers vs. Compensation — teacher counts and teaching staff compensation (exclude sex splits and qualification percentages).
    • Dropped all-NA columns and saved as EdStats_teacher.csv.

This ensures we work with clean, directly comparable variables.

# 1. Enrollment vs. Attendance
# filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
data_filter_attend <- data %>%
  filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_attend.csv")

# 2. Teachers vs. Compensation
#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
data_filter_teacher <- data %>%
  filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_teacher.csv")

5. Attendance vs. Enrollment — Reshape, Compare, and Quantify

The following chunk expands our analysis of EdStats_attend.csv to quantify and visualize the relationship between school enrollment and attendance across education levels in the United States.
After reshaping the dataset into tidy long form, we standardize indicator names to capture both spellings of “enrolment/enrollment” and collapse duplicate series to obtain a single observation for each year × level × type combination.
From these, we compute a new variable representing the attendance gap (Enrollment – Attendance) to measure how much access (enrollment) translates into sustained participation (attendance).

Reading tip:
Focus on whether attendance consistently trails enrollment and whether that gap changes across levels or decades. High correlation indicates structural progress—both metrics rising together—while a persistent positive gap suggests that access alone does not guarantee participation.

data_attend <- read_csv("EdStats_attend.csv")

#select years with most observations
data_attend_years <- data_attend %>%
  select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1999`, `1994`))

#reshape data for visualization
data_long <- data_attend_years %>%
  pivot_longer(cols=`2005`:`2022`, names_to="Year", values_to="Value") %>%
  mutate(
    Year=as.numeric(Year),
    Type=case_when(
      grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
      grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
      TRUE ~ "Other"
    ),
    Level=case_when(
      grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
      grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
      grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
      TRUE ~ "Other"
    )
  ) %>%
  filter(Type!="Other", !is.na(Value))

data_wide <- data_long %>%
  pivot_wider(
    names_from=Type,
    values_from=Value
  )

data_combined <- data_wide %>%
  group_by(Level, Year) %>%
  summarise(
    Attendance = Attendance[!is.na(Attendance)],
    Enrollment = Enrollment[!is.na(Enrollment)]
  ) %>%
  ungroup()

# 2) Collapse duplicates within (Year, Level, Type)
#    (If the catalog has multiple series for the same concept, average them.)
# collapsed <- data_long %>%
#   group_by(Level, Year, Type) %>%
#   summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 3) Wide format with both columns side-by-side
# gap_df <- collapsed %>%
#   pivot_wider(names_from = Type, values_from = Value) %>%
#   mutate(Gap = Enrollment - Attendance)

# Quick diagnostics (see what exists)
# print(table(gap_df$Level, useNA = "ifany"))
# print(summary(gap_df[, c("Enrollment", "Attendance")]))
# 
# # If there are no overlapping rows, bail out gracefully
# if (nrow(drop_na(gap_df, Enrollment, Attendance)) < 2) {
#   warning("No overlapping Enrollment & Attendance observations after cleaning; skipping correlation and scatter.")
# } else {
#   # Correlation and scatter only when data exist
#   corr_df <- gap_df %>% drop_na(Enrollment, Attendance)
# 
#   r_val <- cor(corr_df$Enrollment, corr_df$Attendance, use = "pairwise.complete.obs", method = "pearson")
#   message(sprintf("Correlation between Enrollment and Attendance: r = %.2f", r_val))
# 
#   p_scatter <- ggplot(corr_df, aes(Enrollment, Attendance, color = Level)) +
#     geom_point(alpha = 0.6) +
#     geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
#     labs(
#       title = sprintf("Attendance vs. Enrollment (r = %.2f)", r_val),
#       x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
#     ) +
#     theme_minimal(base_size = 12)
#   print(p_scatter)
# }

model <- lm(Enrollment ~ Attendance, data=data_combined)
# summary(model)  #not significant

by_year_r2 <- data_combined %>%
    group_by(Year) %>%
    do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
    select(Year,r.squared,p.value)
# by_year_r2  #not significant
pretty_r2_table(by_year_r2, "R² and p-values by Year (Red = significant, Black = not significant)")
R² and p-values by Year (Red = significant, Black = not significant)
Year r.squared p.value
2015 0.9997303 0.0104555
2016 0.9659338 0.1181787
2017 0.9330262 0.1666495
2020 0.3520319 0.5956316
2021 0.0093777 0.9382539
by_schooling_r2 <- data_combined %>%
    group_by(Level) %>%
    do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
    select(Level,r.squared,p.value)
# by_schooling_r2  #not significant
pretty_r2_table(by_schooling_r2, "R² and p-values by School Level (Red = significant, Black = not significant)")
R² and p-values by School Level (Red = significant, Black = not significant)
Level r.squared p.value
Elementary School 0.1355655 0.5420218
High School 0.2899115 0.3491842
Middle School 0.0677148 0.6724550
###
p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
  geom_line(size=1.2, na.rm=TRUE) +
  geom_point(size=2, alpha=0.8) +
  facet_grid(~Level, scales="fixed", switch="y") +
  theme_minimal() +
  theme(
    strip.background=element_rect(fill="grey90", color=NA),
    panel.spacing=unit(1, "lines")
  ) +
  labs(
    title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
    y="Rate (%)",
    color="Variable"
  ) +
  scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
  )
# ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)

p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1
# 5) Gap over time (only where both exist)
# if (any(!is.na(gap_df$Gap))) {
#   p_gap <- ggplot(gap_df, aes(Year, Gap, color = Level)) +
#     geom_hline(yintercept = 0, linetype = "dashed") +
#     geom_line(linewidth = 1, na.rm = TRUE) +
#     geom_point(size = 1.3, alpha = 0.8) +
#     labs(title = "Enrollment – Attendance Gap (percentage points)", y = "Gap (pp)", x = "Year") +
#     theme_minimal(base_size = 12)
#   print(p_gap)
# }

###

eqn_text <- paste0(
  "Trendline: y = ",
  round(coef(model)[2],3),"x + ",
  round(coef(model)[1],3),
  "<br>R² = ",round(summary(model)$r.squared,3),
  "<br>p = ",signif(summary(model)$coefficients[2,4],3)
)

enroll_attend_scatter <- ggplot(data_combined, aes(x=Attendance, y=Enrollment)) +
  geom_point(aes(color=factor(Year)), size=2) +
  geom_smooth(method=lm, aes(group=1, text=eqn_text)) +
  theme_minimal() +
  labs(title="All Attendance vs Enrollment Observations",
       x="Attendance",
       y="Enrollment",
       color="Year")

enroll_attend_scatter_plotly <- ggplotly(enroll_attend_scatter, tooltip=c("text","x","y","color"))
enroll_attend_scatter_plotly
####

enroll_longitudinal <- ggplot(data_combined, aes(x=Year, y=Enrollment, color=Level)) +
  geom_point(size=2) +
  geom_smooth(method=lm, size=1, alpha=0.25) +
  theme_minimal() +
  labs(title=" Enrollment Observations Longitudinally",
       x="Year",
       y="Entrollemnt",
       color="Level")
enroll_longitudinal_plotly <- ggplotly(enroll_longitudinal)
enroll_longitudinal_plotly
pretty_cor_table(data_combined$Year, data_combined$Enrollment, "Year", "Enrollment")
Correlation: Year vs Enrollment
Variable1 Variable2 Correlation p.value
Year Enrollment 0.099 0.725
###

attend_longitudinal <- ggplot(data_combined, aes(x=Year, y=Attendance, color=Level)) +
  geom_point(size=2) +
  geom_smooth(method=lm, size=1, alpha=0.25) +
  theme_minimal() +
  labs(title=" Attendance Observations Longitudinally",
       x="Year",
       y="Attendance",
       color="Level")
attend_longitudinal_plotly <- ggplotly(attend_longitudinal)
attend_longitudinal_plotly
pretty_cor_table(data_combined$Year, data_combined$Attendance, "Year", "Attendance")
Correlation: Year vs Attendance
Variable1 Variable2 Correlation p.value
Year Attendance -0.594 0.0196
###

# 6) Summary stats (use collapsed to avoid duplicate inflation)
summary_stats <- data_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),                           
    Mean = mean(Value, na.rm=TRUE),
    Median = median(Value, na.rm=TRUE),
    SD = sd(Value, na.rm=TRUE),
    Variance = var(Value, na.rm=TRUE),
    Minimum = min(Value, na.rm=TRUE),
    Maximum = max(Value, na.rm=TRUE),
    Range = Maximum - Minimum,
    Q1 = quantile(Value, 0.25, na.rm=TRUE),
    Q3 = quantile(Value, 0.75, na.rm=TRUE),
    IQR = Q3 - Q1,
  ) %>%
  arrange(Type, Level)  

paged_table(summary_stats)
write_csv(summary_stats, "tendency_attend.csv")

6. Teachers vs. Compensation — Reshape, Compare, and Quantify

This section examines how teacher workforce size and compensation have evolved in the United States from 2014 to 2021.
Using EdStats_teacher.csv, we identify indicators relating to the number of teachers and the share of educational expenditure allocated to teaching staff compensation.
After filtering out non-teaching staff and qualification percentages, the data are reshaped into tidy long form to create two new analytical variables:

Teacher counts are scaled per 10 000 to make them visually comparable to percentages.
We then:

Reading tip:
Focus on whether the compensation share rises or falls with the number of teachers.
Parallel upward trends may indicate sustained investment in instructional personnel, while divergence could suggest funding reallocation, changing class sizes, or reporting gaps across education levels.

### This is not done


# 1) Read & filter (keep only teaching staff; drop qualification % series)
teacher_raw <- read_csv("EdStats_teacher.csv", show_col_types = FALSE) %>%
  filter(!str_detect(`Indicator name`, regex("non-?teaching staff", ignore_case = TRUE))) %>%
  filter(!str_detect(`Indicator name`, regex("percentage of qualified", ignore_case = TRUE)))

# 2) Long format (auto-detect year columns) + standardized labels
teachers_long0 <- teacher_raw %>%
  pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
  mutate(
    Year = as.numeric(Year),
    Type = case_when(
      str_detect(`Indicator name`, regex("compensation", ignore_case = TRUE)) ~
        "Compensation % of Total Expenditure",
      str_detect(`Indicator name`, regex("teachers? in|number of teachers", ignore_case = TRUE)) ~
        "Number of Teachers",
      TRUE ~ NA_character_
    ),
    Level = case_when(
      str_detect(`Indicator name`, regex("\\bpre-?primary\\b", ignore_case = TRUE)) ~ "Pre-Primary",
      str_detect(`Indicator name`, regex("\\bprimary\\b", ignore_case = TRUE)) ~ "Primary",
      str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Lower Secondary",
      str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "Upper Secondary",
      str_detect(`Indicator name`, regex("\\bsecondary\\b", ignore_case = TRUE)) ~ "Secondary (unspecified)",
      str_detect(`Indicator name`, regex("\\btertiary\\b|post-?secondary|higher education", ignore_case = TRUE)) ~ "Tertiary",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(Type), !is.na(Level), !is.na(Value))

# 3) Collapse duplicates within (Year, Level, Type) to ensure 1 row per cell
teachers_long <- teachers_long0 %>%
  group_by(Level, Year, Type) %>%
  summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop")

# 4) Scale teacher counts to 10,000s for visual comparability with percentages
teachers_long <- teachers_long %>%
  mutate(Value_scaled = if_else(Type == "Number of Teachers", Value / 10000, Value))

# (Optional) order levels for consistent faceting
lvl_order <- c("Pre-Primary", "Primary", "Lower Secondary", "Upper Secondary",
               "Secondary (unspecified)", "Tertiary")
teachers_long <- teachers_long %>%
  mutate(Level = factor(Level, levels = intersect(lvl_order, unique(Level))))

# ---- A. Time-series trends by level (scaled) ----
p_line_teacher <- ggplot(teachers_long, aes(x = Year, y = Value_scaled, color = Type)) +
  geom_line(linewidth = 1.1, na.rm = TRUE) +
  geom_point(size = 1.7, alpha = 0.8) +
  facet_wrap(~ Level, ncol = 3, scales = "fixed") +
  theme_minimal(base_size = 12) +
  theme(strip.background = element_rect(fill = "grey95", color = NA)) +
  scale_color_manual(
    values = c("Number of Teachers" = "#FF69B4",
               "Compensation % of Total Expenditure" = "#3BBF8F"),
    labels = c("Number of Teachers (per 10,000)", "Compensation (% of total)")
  ) +
  labs(
    title = "Teachers & Compensation Trends by Level (USA)",
    y = "Scaled value (Teachers per 10,000; Compensation %)",
    color = "Series"
  ) +
  scale_x_continuous(breaks = pretty(unique(teachers_long$Year)))
p_line_teacher

# Interactive view (kept optional for HTML)
ggplotly(p_line_teacher)
# ---- B. Coverage diagnostics (counts by Type / Year / Level) ----
p_count_type <- ggplot(teachers_long, aes(x = Type)) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Count of Observations by Type", x = "Series", y = "Count")
p_count_type

p_count_year <- ggplot(teachers_long, aes(x = factor(Year))) +
  geom_bar(fill = "#c562f0") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations per Year (all levels)", x = "Year", y = "Count")
p_count_year

p_count_level <- ggplot(teachers_long, aes(x = Level, fill = Type)) +
  geom_bar(position = "dodge") +
  theme_minimal(base_size = 12) +
  labs(title = "Coverage: Observations by Level and Type", x = "Level", y = "Count", fill = "Series")
p_count_level

# ---- C. Correlation between staffing and compensation (where both exist) ----
# Wide table on the *scaled* values so axes are comparable
wide_tc <- teachers_long %>%
  select(Level, Year, Type, Value_scaled) %>%
  pivot_wider(names_from = Type, values_from = Value_scaled)

# overall (pooled) correlation with guard
corr_tbl <- tibble::tibble()

valid_overall <- wide_tc %>%
  mutate(valid = complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)) %>%
  filter(valid)

if (nrow(valid_overall) >= 3) {
  overall_r <- cor(valid_overall$`Number of Teachers`,
                   valid_overall$`Compensation % of Total Expenditure`,
                   use = "complete.obs", method = "pearson")
  corr_tbl <- bind_rows(corr_tbl, tibble(Level = "All levels (pooled)", r = overall_r))
}

# by-level correlations (no cur_data(); compute within each group)
by_level_r <- wide_tc %>%
  group_by(Level) %>%
  summarise(
    n_pairs = sum(complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)),
    r = ifelse(
      n_pairs >= 3,
      cor(`Number of Teachers`, `Compensation % of Total Expenditure`,
          use = "complete.obs", method = "pearson"),
      NA_real_
    ),
    .groups = "drop"
  ) %>%
  filter(!is.na(r))

corr_tbl <- bind_rows(corr_tbl, by_level_r %>% select(Level, r))
corr_tbl
## # A tibble: 6 × 2
##   Level                        r
##   <chr>                    <dbl>
## 1 All levels (pooled)     -0.313
## 2 Pre-Primary             -0.571
## 3 Primary                  0.551
## 4 Lower Secondary         -0.830
## 5 Upper Secondary         -0.862
## 6 Secondary (unspecified) -0.334
# ---- D. Summary statistics (per Type × Level) ----
summary_stats_teachers <- teachers_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),
    mean_value = mean(Value, na.rm = TRUE),
    median_value = median(Value, na.rm = TRUE),
    sd_value = sd(Value, na.rm = TRUE),
    min_value = min(Value, na.rm = TRUE),
    max_value = max(Value, na.rm = TRUE),
    IQR = IQR(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(Type, Level)

write_csv(summary_stats_teachers, "tendency_teachers.csv")
summary_stats_teachers
## # A tibble: 12 × 9
##    Type         Level     n mean_value median_value sd_value min_value max_value
##    <chr>        <fct> <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
##  1 Compensatio… Pre-…     6       49.0         49.0  6.83e-1      48.1      49.8
##  2 Compensatio… Prim…     6       49.0         49.0  6.83e-1      48.1      49.8
##  3 Compensatio… Lowe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  4 Compensatio… Uppe…     6       49.0         49.0  6.83e-1      48.1      49.8
##  5 Compensatio… Seco…     6       46.8         46.7  9.30e-1      46.1      48.7
##  6 Compensatio… Tert…     6       28.5         28.3  7.90e-1      27.6      30.0
##  7 Number of T… Pre-…    12   531226.      608720.   1.40e+5  302255    651997. 
##  8 Number of T… Prim…    17  1604827.     1687937    1.49e+5 1414000   1774348. 
##  9 Number of T… Lowe…    14   810580.      858595.   8.81e+4  670172    894725. 
## 10 Number of T… Uppe…    14   774670.      813548.   7.75e+4  650603    848440. 
## 11 Number of T… Seco…    15  1501112.     1661375    2.95e+5  784414.  1737206. 
## 12 Number of T… Tert…    22   853184.      747500    3.14e+5  574000   1581424  
## # ℹ 1 more variable: IQR <dbl>

7. Discussion

Interim Takeaways

  • Enrollment vs. Attendance: Attendance rates consistently trail enrollment rates at all levels, though the gap varies by level and year. This may point to structural barriers beyond access.
  • Teachers vs. Compensation: Preliminary trends suggest whether compensation shares grow with staffing. Divergences could signal reallocations or coverage differences.

These observations are preliminary and will guide further analysis.

Reproducibility Notes
All intermediate datasets and figures are written to disk (EdStats_USA.csv, EdStats_attend.csv, EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating earlier processing.
Consider adding sessionInfo() at the end to document package versions.


8. Works Cited

reference string

reference string

reference string